Unsupervised Learning of a Chinese Spontaneous and Colloquial Speech Lexicon with Content and Filler Phrase Classification

نویسندگان

  • Chi Shun Cheung
  • Pascale Fung
چکیده

There is significant lexical difference—words and usage of words-between spontaneous/colloquial language and the written language. This difference affects the performance of spoken language recognition systems that use statistical language models or context-free-grammars because these models are based on the written language rather than the spoken form. There are many filler phrases and colloquial phrases that appear solely or more often in spontaneous and colloquial speech. Chinese languages perhaps exemplify such a difference as many colloquial forms of the language, such as Cantonese, exist strictly in spoken forms and are different from the written standard Chinese, which is based on Mandarin. A conventional way of dealing with this issue is to add colloquial terms manually to the lexicon. However, this is time-consuming and expensive. Meanwhile, supervised learning requires manual tagging of large corpuses, which is also time-consuming. We propose an unsupervised learning method to find colloquial terms and classify filler and content phrases in spontaneous and colloquial Chinese, including Cantonese. We propose using frequency strength, and spread measures of character pairs and groups to extract automatically frequent, out-of-vocabulary colloquial terms to add to a standard Chinese lexicon. An unsegmented, and unannotated corpus is segmented with the augmented lexicon. We then propose a Markov classifier to classify Chinese characters into either content or filler phrases in an iterative training method. This method is task-independent and can extract even mixed language terms. We show the effectiveness of our method by both a natural language query processing task and an adaptive Cantonese language-modeling task. The precision for content phrase extraction and classification is around 80%, with a recall of 99%, and the precision for filler phrase extraction and classification is around 99.5% with a recall of approximately 89%. The web search precision using these extracted content words is comparable to that of the search results with content phrases selected by humans. We adapt a language model trained from written texts with the Hong Kong Newsgroup corpus. It outperforms both the standard Chinese language model and also the Cantonese language model. It also performs better than the language model trained a simply by concatenating two sets of standard and colloquial texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Understanding Chinese Spontaneous Speech - Are Mandarin and Cantonese Very Different?

This paper presents a study of the similarity between Cantonese and Mandarin spoken and written texts. Spontaneous speech in Cantonese consists of colloquial and filler phrases but it’s keywords similar to Mandarin. We use a statistical tool to extract Cantonese phrases from a spontaneous speech database. We collected using a Wizard-of-Oz setup. More fillers are collected from written Cantonese...

متن کامل

Unsupervised Learning of Non-Uniform Segmental Units for Acoustic Modeling in Speech Recognition

Great progress has been made in the development of recognition systems for continuous read speech but the performance of these systems degrades severely when they are applied to spontaneous speech. This indicates that a different approach in modeling is required to design a system that is better suited to spontaneous speech. Our approach is to combine two advances proposed in previous work: the...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

Code-Copying in the Balochi Language of Sistan

This empirical study deals with language contact phenomena in Sistan. Code-copying is viewed as a strategy of linguistic behavior when a dominated language acquires new elements in lexicon, phonology, morphology, syntax, pragmatic organization, etc., which can be interpreted as copies of a dominating language. In this framework Persian is regarded as the model code which provides elements for b...

متن کامل

On-line learning of acoustic and lexical units for domain-independent ASR

We are interested in on-line acquisition of acoustic, lexical and semantic units from spontaneous speech. Traditional ASR techniques require the domain-speci c knowledge of acoustic, lexicon data and more importantly the word probability distributions. In this paper we propose an algorithm for unsupervised learning of acoustic and lexical units from out-of-domain speech data. The new lexical un...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • I. J. Speech Technology

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2004